The Role of Lexicalization and Pruning for Base Noun Phrase Grammars
نویسندگان
چکیده
This paper explores the role of lexicalization and pruning of grammars for base noun phrase identification. We modify our original framework (Cardie & Pierce 1998) to extract lexicalized treebank grammars that assign a score to each potential noun phrase based upon both the part-of-speech tag sequence and the word sequence of the phrase. We evaluate the modified framework on the “simple” and “complex” base NP corpora of the original study. As expected, we find that lexicalization dramatically improves the performance of the unpruned treebank grammars; however, for the simple base noun phrase data set, the lexicalized grammar performs below the corresponding unlexicalized but pruned grammar, suggesting that lexicalization is not critical for recognizing very simple, relatively unambiguous constituents. Somewhat surprisingly, we also find that error-driven pruning improves the performance of the probabilistic, lexicalized base noun phrase grammars by up to 1.0% recall and 0.4% precision, and does so even using the original pruning strategy that fails to distinguish the effects of lexicalization. This result may have implications for many probabilistic grammar-based approaches to problems in natural language processing: error-driven pruning is a remarkably robust method for improving the performance of probabilistic and non-probabilistic grammars alike. Introduction In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) Base noun phrase identification (see Figure 1) is a critical component in many large-scale natural language processing (NLP) applications: it is among the first steps for many partial parsers; information retrieval systems rely on base noun phrases as a primary source of linguistic phrases for indexing; base noun phrases support information extraction, a variety of text-mining operations, and distributional clustering techniques that attempt to relieve sparse data problems. As a result, a number of researchers have targeted the problem of base noun phrase recognition (Church 1988; Bourigault 1992; Voutilainen 1993; Justeson & Katz 1995). Copyright c ©1999, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. [The survival] of [spinoff Cray Computer Corp.] as [a fledgling] in [the supercomputer business] appears to depend heavily on [the creativity] — and [longevity] — of [its chairman] and [chief designer], [Seymour Cray]. Base noun phrases: simple, nonrecursive noun phrases — noun phrases that do not contain other noun phrase descendants. Figure 1: Base NP Examples Only recently, however, have efforts in this area attempted the automatic acquisition of base noun phrase (base NP) parsers and their automatic evaluation on the same large test corpus: Ramshaw & Marcus (1998) applied transformation-based learning (Brill 1995); Argamon, Dagan, & Krymolowski (1998) devised a memorybased sequence learning (MBSL) method; previously we introduced error-driven pruning of treebank grammars (Cardie & Pierce 1998). All three methods for base NP recognition have been evaluated using partof-speech tagged and base NP annotated corpora derived from the Penn Treebank (Marcus, Marcinkiewicz, & Santorini 1993), thus offering opportunity for more direct comparison than algorithms that have been evaluated by hand. Ramshaw & Marcus’s noun phrase bracketer learns a set of transformation rules. Each transformation locally updates the noun phrase bracketing associated with a single word based on nearby features, such as neighboring words, part-of-speech tags, and bracket boundaries. After the training phase, the learned transformations are applied, in order, to each novel text to identify base NPs. Argamon et al. develop a variation of memory-based learning (Stanfill & Waltz 1986) for base NP recognition. During training, their MBSL algorithm saves the entire raw training corpus. Generalization of the implicit noun phrase rules in the training corpus occurs at application time — MBSL searches the novel text for tag sequences or combinations of tag subsequences (tiles) that occurred during training in a similar context. Our corpus-based algorithm for noun phrase recognition uses a simpler representation for the base NP grammar, namely part-of-speech tag sequences. It extracts the tag sequence grammar from the treebank training corpus, then prunes it using an error-based benefit metric. To identify a base NP in a novel sentence, a simple longest-match bracketer scans input text from left to right, at each point selecting the longest sequence of tags matching a grammar rule (if any) to form a base NP. The approach has a number of attractive features: both the training and the bracketer are very simple; the bracketer is very fast; the learned grammar can be easily modified. Nonetheless, while the accuracy of the treebank approach is very good for applications that require or prefer fairly simple base NPs, it lags the alternative approaches when identifying more complex noun phrases. This can be explained in part by examining the sources of knowledge employed by each method: The treebank approach uses neither lexical (i.e. word-based) information nor context; MBSL captures context in its tiles, but uses no lexical information; the transformation-based learner uses both lexical information and the surrounding context to make decisions about bracket boundary placement. Context and lexicalization have been shown to be important across a variety of natural language learning tasks; as a result, we might expect the treebank approach to improve with the addition of either. However, lexicalization and its accompanying transition to a probabilistic grammar has at least one goal similar to that of pruning: to reduce the effect of “noisy” rules in the grammar. Therefore, it is not clear that lexicalization alone will improve the error-driven pruning treebank approach to base noun phrase recognition. This paper explores the role of lexicalization and pruning of base noun phrase grammars. More specifically, we modify our original framework to extract lexicalized treebank grammars that assign a score to each potential noun phrase based upon both the tag sequence and the word sequence of the phrase. In addition, we extend the noun phrase bracketer to select the combination of brackets with the highest score. We evaluate the modified framework on the “simple” and “complex” base NP corpora of the original study. As expected, we find that lexicalization dramatically improves the performance of the unpruned treebank grammars, with an increase in precision and recall of approximately 70% and 50%, respectively. However, for the simple base NP data set, the lexicalized grammar still performs below the unlexicalized, but pruned, grammar of the original base NP study, suggesting that lexicalization is not critical for recognizing very simple, relatively unambiguous constituents. For the simple base NP task, pruning serves much the same function as lexicalization in that both suppress the application of bad rules. Pruning, however, allows the simplicity of the grammar and bracketing procedure to remain intact. In contrast, for more complex base NPs, the lexicalized grammar performs comparably to the pruned unlexicalized grammar of the original study. Thus, the importance of lexical information appears to increase with the complexity of the linguistic task: more than just the tag sequence is needed to determine the quality of a candidate phrase when the targets are more ambiguous. For many applications, however, the added complexity of the lexicalized approach may not be worth the slight increase in performance. There were a couple of surprises in our results: both the longest-match heuristic of the original bracketer and the original error-driven pruning strategy improved the performance of the lexicalized base NP grammars. First, the longest-match heuristic proved to be more useful than lexical information for the unpruned grammar on both corpora. Second, we found that pruning improves bracketing by up to 1.0% recall and 0.4% precision, and does so even using the original strategy that fails to distinguish the effects of lexicalized rules. This result may have implications for many probabilistic grammar-based approaches to problems in natural language processing: error-driven pruning is a remarkably robust method for improving the performance of probabilistic and non-probabilistic grammars alike. The next section of this paper defines base noun phrases and reviews the basic framework used to extract, prune, and apply grammars using a treebank base NP corpus. The following section extends the framework to include lexicalized grammars. We then evaluate the modified framework and conclude with a discussion and comparison of approaches to base noun phrase identification. The Treebank Approach to Base Noun Phrase Identification In this work we define base NPs to be simple, nonrecursive noun phrases — noun phrases that do not contain other noun phrase descendants. The bracketed portions of Figure 1, for example, show the base NPs in one sentence from the Penn Treebank Wall Street Journal corpus. Thus, the string the survival of spinoff Cray Computer Corp. as a fledgling in the supercomputer business is too complex to be a base NP; instead, it contains four simpler noun phrases, each of which is considered a base NP: the survival, spinoff Cray Computer Corp., a fledgling, and the supercomputer business. This section reviews the treebank approach to base noun phrase identification depicted in Figure 2. For more detail, see Cardie & Pierce (1998).
منابع مشابه
Investigating Embedded Question Reuse in Question Answering
The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...
متن کاملError-driven Pruning of Treebank Grammars for Base Noun Phrase Identiication
Finding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identi cation have been rather complex, this paper instead proposes a very simple algorithm that is tailored to the relative simplicity of the task. In particular, we present a corpus-based approach for nding base NPs by matchin...
متن کاملError-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification
Finding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identification have been rather complex, this paper instead proposes a very simple algorithm that is tailored to the relative simplicity of the task. In particular, we present a corpus-based approach for finding base NPs by matc...
متن کاملDo Heavy-NP Shift Phenomenon and Constituent Ordering in English Cause Sentence Processing Difficulty for EFL Learners?
Heavy-NP shift occurs when speakers prefer placing lengthy or “heavy” noun phrase direct objects in the clause-final position within a sentence rather than in the post-verbal position. Two experiments were conducted in this study, and their results suggested that having a long noun phrase affected the ordering of constituents (the noun phrase and prepositional phrase) by advanced Iranian EFL le...
متن کاملLearning Computational Grammars
This paper reports on the LEARNING COMPUTATIONAL GRAMMARS (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the dat...
متن کامل